1 Introduction

In this evaluation, there are total 6 datasets. We used the evaluation metrics implemented in OmicsEV package to evaluate these datasets. The sample and class information for each dataset are shown in the table below.

class d1 d2 d3 d4 d5 d6
Basal 17 17 17 17 17 17
Her2 12 12 12 12 12 12
LumA 19 19 19 19 19 19
LumB 22 22 22 22 22 22
None 16 16 16 16 16 16

The detailed sample information is shown below.

sample class batch order
TCGA.A2.A0CM Basal 1 1
TCGA.A2.A0D0 Basal 1 2
TCGA.A2.A0D1 None 1 3
TCGA.A2.A0D2 Basal 1 4
TCGA.A2.A0EQ Her2 1 5
TCGA.A2.A0EV LumA 1 6
TCGA.A2.A0EX LumA 1 7
TCGA.A2.A0EY LumB 1 8
TCGA.A2.A0SW LumB 1 9
TCGA.A2.A0SX Basal 1 10
TCGA.A2.A0T1 Her2 1 11
TCGA.A2.A0T2 Basal 1 12
TCGA.A2.A0T6 LumA 1 13
TCGA.A2.A0T7 LumA 1 14
TCGA.A2.A0YC LumA 1 15
TCGA.A2.A0YD LumA 1 16
TCGA.A2.A0YF LumA 1 17
TCGA.A2.A0YG LumB 1 18
TCGA.A2.A0YI LumA 1 19
TCGA.A2.A0YL LumA 1 20
TCGA.A2.A0YM Basal 1 21
TCGA.A7.A0CD LumA 1 22
TCGA.A7.A0CE Basal 1 23
TCGA.A7.A0CJ LumB 1 24
TCGA.A8.A06N LumB 1 25
TCGA.A8.A06Z LumB 1 26
TCGA.A8.A076 LumB 1 27
TCGA.A8.A079 LumB 1 28
TCGA.A8.A09G Her2 1 29
TCGA.A8.A09I LumB 1 30
TCGA.AN.A04A None 1 31
TCGA.AN.A0AJ LumB 1 32
TCGA.AN.A0AL Basal 1 33
TCGA.AN.A0AM LumB 1 34
TCGA.AN.A0AS LumA 1 35
TCGA.AN.A0FK LumA 1 36
TCGA.AN.A0FL Basal 1 37
TCGA.AO.A03O None 1 38
TCGA.AO.A0J6 None 1 39
TCGA.AO.A0J9 None 1 40
TCGA.AO.A0JC None 1 41
TCGA.AO.A0JE None 1 42
TCGA.AO.A0JJ None 1 43
TCGA.AO.A0JL None 1 44
TCGA.AO.A0JM None 1 45
TCGA.AO.A126 None 1 46
TCGA.AO.A12B None 1 47
TCGA.AO.A12E None 1 48
TCGA.AR.A0TR LumA 1 49
TCGA.AR.A0TT LumB 1 50
TCGA.AR.A0TV LumB 1 51
TCGA.AR.A0TX Her2 1 52
TCGA.AR.A0U4 None 1 53
TCGA.BH.A0EE Her2 1 54
TCGA.BH.A0HP LumA 1 55
TCGA.A2.A0T3 LumB 2 56
TCGA.A7.A13F LumB 2 57
TCGA.AO.A12D None 2 58
TCGA.AO.A12F None 2 59
TCGA.AR.A0TY LumB 2 60
TCGA.AR.A1AQ Basal 2 61
TCGA.AR.A1AV LumA 2 62
TCGA.AR.A1AW LumB 2 63
TCGA.BH.A0AV Basal 2 64
TCGA.BH.A0C1 LumA 2 65
TCGA.BH.A0C7 LumB 2 66
TCGA.BH.A0E9 LumA 2 67
TCGA.C8.A12L Her2 2 68
TCGA.C8.A12P Her2 2 69
TCGA.C8.A12Q Her2 2 70
TCGA.C8.A12T Her2 2 71
TCGA.C8.A12U LumB 2 72
TCGA.C8.A12V Basal 2 73
TCGA.C8.A12W LumB 2 74
TCGA.C8.A12Z Her2 2 75
TCGA.C8.A130 LumB 2 76
TCGA.C8.A131 Basal 2 77
TCGA.C8.A134 Basal 2 78
TCGA.C8.A135 Her2 2 79
TCGA.C8.A138 Her2 2 80
TCGA.D8.A13Y LumB 2 81
TCGA.D8.A142 Basal 2 82
TCGA.E2.A10A LumA 2 83
TCGA.E2.A150 Basal 2 84
TCGA.E2.A154 LumA 2 85
TCGA.E2.A159 Basal 2 86

2 Descriptive

2.1 Protein/gene identification and quantification

The table below shows the number of identified proteins or genes for each dataset. We take the proteins or genes filtered by 50% missing value as quantified proteins or genes.

dataSet # proteins (genes) # proteins (genes) [50%]
d1 20501 18694
d2 20501 18717
d3 20501 18694
d4 20501 18694
d5 20501 18694
d6 20501 18694

Upset chart below showing overlap in proteins or genes identified in each dataset. Numbers of identified proteins or genes shared between different datasets are indicated in the top bar chart and the specific datasets in each set are indicated with solid points below the bar chart. Total identifications for each dataset are indicated on the left as ‘Set size’.

2.2 Protein/gene number distribution

The figures below show the number of proteins or genes identified in each sample. The samples from different batches are coded in different shapes and the samples from different classes are coded in different colors.

d1d2d3d4d5d6

3 Normalization and batch effect

3.1 Protein or gene expression distribution

The boxplots show the protein or gene expression distribution across samples. X axis is sample ordered by input order. Y axis is log2 transformed protein or gene expression. The samples from different classes are coded in different colors.

d1d2d3d4d5d6

The density plots show the protein or gene expression distribution across samples. X axis is log2 transformed protein or gene expression. Y axis is density.

3.2 Batch effect (Heatmap ordered by batches)

In these figures, each column is a sample, each row is also a sample. The color indicates the correlation between samples. The samples are ordered by batches.

d1d2d3d4d5d6

3.3 Batch effect evaluation using kBET

In this section, we used k-nearest neighbour batch effect test (kBET) for quantification of batch effects. First, the algorithm creates k-nearest neighbour matrix and choses 10% of the samples to check the batch label distribution in its neighbourhood. If the local batch label distribution is sufficiently similar to the global batch label distribution, the \(\chi^2\)-test does not reject the null hypothesis (that is “all batches are well-mixed”). Finally, the result of kBET is the average test rejection rate. The lower the test result, the less bias is introduced by the batch effect.

dataSet kBET.expected kBET.observed kBET.signif
d1 0.013 0.000 1.000
d2 0.006 0.000 1.000
d3 0.007 0.000 1.000
d4 0.004 0.000 1.000
d5 0.000 0.009 0.873
d6 0.000 0.000 1.000

3.4 Batch effect evaluation using silhouette width

The silhouette width s(i) ranges from –1 to 1, with s(i) -> 1 if two clusters are separate and s(i) -> −1 if two clusters overlap but have dissimilar variance. If s(i) -> 0, both clusters have roughly the same structure. Thus, we use the absolute value |s| as an indicator for the presence or absence of batch effects.

dataSet silhouette_width
d1 d1 0.014
d2 d2 0.000
d3 d3 0.009
d4 d4 0.014
d5 d5 0.020
d6 d6 0.021

3.5 Batch effect evaluation based on principal components

For each PC, we calculate Pearson’s correlation coefficient with batch covariate b:

ri = corr(PCi,b)

In a linear model with a single dependent, as is the case here for the PCs correlated to batch covariate, the coefficient of determination R2 is the squared Pearson’s correlation coefficient:

R2(PCi,b) = ri2

Then we estimate the significance of the correlation coefficient either with a t-test or a one-way ANOVA. The R2 value highlighted with red is significant (p-value <= 0.05).

PC d1 d2 d3 d4 d5 d6
PC1 0.019 0 0.024 0.027 0.015 0.015
PC10 0.001 0.015 0.001 0.001 0.006 0.004
PC2 0.024 0.014 0.026 0.025 0.021 0.016
PC3 0.012 0.014 0.009 0.011 0.015 0.009
PC4 0.008 0.004 0 0 0.018 0.012
PC5 0.043 0.044 0.045 0.042 0.029 0
PC6 0.004 0.027 0.011 0.015 0.002 0.001
PC7 0.002 0.003 0.008 0.006 0 0.002
PC8 0.017 0.003 0.007 0.009 0.029 0.003
PC9 0.009 0.001 0.021 0.019 0.001 0.025

3.6 Batch effect evaluation using pca score plot

In these figures, each column is a sample, each row is also a sample. The color indicates the correlation between samples. The samples are ordered by batches.

3.7 Protein or gene coefficient of variation (CV) distribution

d1d2d3d4d5d6

3.8 Missing value distribution

The missing value distribution can give an overview of the percent of missing values of all proteins or genes in both the QC and experiment samples.

d1d2d3d4d5d6

4 Unsupervised analysis of samples

4.1 PCA

d1d2d3d4d5d6

4.2 Cluster analysis

d1d2d3d4d5d6

5 Correlation between proteins

5.1 Within vs between protein complexes

The table showing below is a summary of the evaluation. “diff” is Cor(intra) - Cor(inter). “ks” is the statistic value of Kolmogorov-Smirnov test.

dataSet InterComplex IntraComplex diff ks
d1 0.034 0.160 0.127 0.235
d2 0.003 0.105 0.102 0.208
d3 0.017 0.177 0.160 0.282
d4 0.011 0.165 0.155 0.275
d5 0.064 0.178 0.115 0.207
d6 0.032 0.159 0.127 0.238

6 Correlation between mRNA and protein

6.1 Gene-wise correlation

dataSet n n5 n6 n7 n8 median_cor
d1 9129 1911 773 210 21 0.329
d2 9131 1989 837 222 22 0.335
d3 9129 1993 823 223 24 0.335
d4 9129 2006 837 225 24 0.337
d5 9129 1764 693 185 20 0.321
d6 9129 1931 763 207 20 0.328

6.2 Sample-wise correlation

dataSet median_cor
d1 0.142
d2 0.142
d3 0.142
d4 0.142
d5 0.142
d6 0.138

7 Phenotype prediction

Build model for prediction: LumA,LumB.

dataSet Variables ROC Sens Spec
d1 18694 0.993 0.947 0.955
d2 18717 0.994 0.947 0.955
d3 18694 0.993 0.947 0.909
d4 18694 0.992 0.947 0.864
d5 18694 0.993 0.947 1.000
d6 18694 0.996 0.947 1.000

8 Co-expression network based function prediction

In this evaluation, each dataset was used to build co-expression network. For a selected network and a selected function term (such as GO or KEGG), proteins/genes annotated to the term and also included in the network were defined as a positive protein/gene set and other proteins/genes in the network constituted the negative protein/gene set for the term. For a selected function term, we use some of the proteins/genes as the seed protein/gene, then we use random walk algorithm to calculate scores for other proteins/genes. A higher s core of a protein/gene represents a closer relationship between the protein/gene and the seed proteins/genes. Finally, for each selected function term, we calculate an AUROC to evaluate the prediction performance.

d1 d2 d3 d4 d5 d6
Allograft rejection 0.951 0.974 0.99 0.989 0.952 0.954
Aminoacyl-tRNA biosynthesis 0.81 0.812 0.78 0.777 0.805 0.829
Antigen processing and presentation 0.873 0.814 0.859 0.861 0.848 0.821
Asthma 0.977 0.946 0.951 0.956 0.976 0.952
Autoimmune thyroid disease 0.921 0.95 0.951 0.95 0.916 0.916
Cell adhesion molecules (CAMs) 0.78 0.803 0.817 0.805 0.833 0.815
Citrate cycle (TCA cycle) 0.776 0.601 0.806 0.766 0.692 0.735
Complement and coagulation cascades 0.826 0.826 0.88 0.852 0.846 0.815
DNA replication 0.897 0.898 0.88 0.893 0.898 0.894
ECM-receptor interaction 0.872 0.831 0.851 0.848 0.841 0.842
Fatty acid biosynthesis 0 0 0.805 0.694 0 0
Glycosphingolipid biosynthesis - lacto and neolacto series 0.799 0.743 0.708 0.661 0.738 0.813
Graft-versus-host disease 0.955 0.967 0.984 0.984 0.947 0.946
Hematopoietic cell lineage 0.795 0.775 0.794 0.793 0.81 0.775
Homologous recombination 0.85 0.729 0.766 0.747 0.812 0.782
Intestinal immune network for IgA production 0.83 0.921 0.875 0.897 0.79 0.83
Leishmaniasis 0.776 0.782 0.799 0.812 0.768 0.789
Malaria 0.869 0.834 0.876 0.862 0.872 0.849
Metabolism of xenobiotics by cytochrome P450 0.764 0.749 0.776 0.842 0.82 0.706
Mismatch repair 0.821 0.819 0.834 0.825 0.815 0.842
Oxidative phosphorylation 0.825 0.755 0.86 0.85 0.821 0.837
Parkinsons disease 0.807 0.732 0.808 0.818 0.783 0.811
Primary immunodeficiency 0.86 0.846 0.858 0.842 0.851 0.853
Proteasome 0.882 0.815 0.913 0.912 0.893 0.863
Protein export 0.79 0.782 0.875 0.855 0.781 0.827
Retinol metabolism 0.871 0.711 0.79 0.785 0.854 0.853
Ribosome 0.945 0.885 0.948 0.956 0.937 0.948
Spliceosome 0.811 0.799 0.822 0.836 0.775 0.808
Staphylococcus aureus infection 0.92 0.926 0.94 0.928 0.903 0.912
Steroid hormone biosynthesis 0.791 0.789 0.753 0.796 0.883 0.758
Systemic lupus erythematosus 0.849 0.863 0.847 0.86 0.844 0.835
Terpenoid backbone biosynthesis 0.72 0.655 0.809 0.828 0.733 0.731
Type I diabetes mellitus 0.847 0.865 0.851 0.869 0.865 0.837
Viral myocarditis 0.791 0.758 0.844 0.845 0.76 0.74